SNOW-3485471: cte dedup for self-join case by sfc-gh-aling · Pull Request #4245 · snowflakedb/snowpark-python

sfc-gh-aling · 2026-05-29T17:43:14Z

Which Jira issue is this PR addressing? Make sure that there is an accompanying issue to your PR.

In SCOS, spark.table() translates into SnowflakePlan nodes with non-empty expr_to_alias. The TPCDS_Q39 query aliases the same inventory DataFrame as inv1 and inv2 and self-joins them. The join path
deep-copies the plan for each branch and calls add_aliases separately, resulting in both branches carrying the same alias values under different UUID keys. The old key-sort hashing produced
different hashes for the two branches, so the CTE optimizer did not detect them as duplicates and inlined the full aggregation pipeline twice. The value-sort fix correctly identifies them as
the same computation and emits a single CTE.

Fill out the following pre-review checklist:
- I am adding a new automated test(s) to verify correctness of my new code
  - If this test skips Local Testing mode, I'm requesting review from @snowflakedb/local-testing
- I am adding new logging messages
- I am adding a new telemetry message
- I am adding new credentials
- I am adding a new dependency
- If this is a new feature/behavior, I'm adding the Local Testing parity changes.
- I acknowledge that I have ensured my changes to be thread-safe. Follow the link for more information: Thread-safe Developer Guidelines
- If adding any arguments to public Snowpark APIs or creating new public Snowpark APIs, I acknowledge that I have ensured my changes include AST support. Follow the link for more information: AST Support Guidelines
Please describe how your code solves the related issue.

hash node by expr_to_alias value instead expr_to_alias items
conflicting alias detection to avoid over optimization

codecov-commenter · 2026-05-29T18:23:07Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 95.39%. Comparing base (495acce) to head (c29f910).

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #4245   +/-   ##
=======================================
  Coverage   95.38%   95.39%           
=======================================
  Files         171      171           
  Lines       44243    44252    +9     
  Branches     7557     7559    +2     
=======================================
+ Hits        42203    42215   +12     
+ Misses       1252     1250    -2     
+ Partials      788      787    -1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…ation

sfc-gh-aalam · 2026-05-29T23:01:33Z

+        # Hash by alias values only, not the UUID keys, since UUID keys are regenerated on every deep-copy/re-resolve (e.g. the two
+        # branches of a self-join). This lets nodes representing the same computation hash identically, enabling CTE dedup for self-joins.
+        # NOTE: since nodes with different UUID keys can now share a CTE, _replace_duplicate_node_with_cte must merge each duplicate's
+        # UUID→alias entries into the shared CTE so parent re-resolution can resolve any UUID variant (see companion comment there).
+        # Different alias values (e.g. a "_WITH_AD_GROUP" join suffix from _disambiguate) still hash differently, preserving SNOW-2261400.
+        string = f"{string}#{sorted(set(node.expr_to_alias.values()))}"


what kind of cases will this produce same hash-keys. When it is wrong, what kind of risk are we dealing with

I assume you are asking the hash-keys for a node -- same hash-keys are produced by nodes with same sql, sql params, df aliases, and expr_to_alias values (before the change it's by expr_to_alias.items().

(hash-keys for expr_to_alias are generated from uuid, it can rarely happen that two uuid collide)

the only risk is parent column resolution as we relax the standard of repeated node identification. I introduced extra logic to detect conflicting expr_to_alias (same expr maps to two different alias) in nodes to prevent CTE in this case. thank you for bringing it up

Is there any chance of duplicate value entries here that would silently get swallowed by the sorted(set(...)) call?

Yes, duplicate alias can happen but consolidating them is harmless in our case, it's not reflected in the newly-generated CTE optimized node, and the generated sql in the hash is enough for dup node detection in the self-join case.

the expr_to_alias is mostly for internal column name resolution in the downstream compilation stage.

theoretically I think this info shall be excluded from the hash computation of a node, but right now I keep it as a defensive layer to distinguish nodes.

sfc-gh-aling · 2026-05-29T23:11:04Z

-        # When adding a lsuffix, expr alias map will be updated, so df2 and df3 are considered
-        # different and have different ids. So only df1 and df will be converted to a CTE
+        # With value-sort hashing, df1/df2/df3 now hash identically (same SQL +
+        # same alias values, different UUID keys). df2 and df3 are replaced with a
+        # shared CTE, but df1's left-join position remains inline. That gives 2
+        # CTEs (base VALUES + filtered df1) and the filter appears twice (once in
+        # the CTE body, once inline for the left-join position).
        assert (
-            count_number_of_ctes(Utils.normalize_sql(df6.queries["queries"][-1])) == 1
+            count_number_of_ctes(Utils.normalize_sql(df6.queries["queries"][-1])) == 2
        )
-        assert Utils.normalize_sql(df6.queries["queries"][-1]).count(filter_clause) == 3
+        assert Utils.normalize_sql(df6.queries["queries"][-1]).count(filter_clause) == 2


for clarity, more CTE is better for this case, see the below comparison between the generated sql:

Before (1 CTE, filter appears 3 times):

WITH CTE_values AS ( SELECT $1 AS "A", $2 AS "B" FROM VALUES (1,2),(3,4) ) SELECT * FROM ( ( SELECT "A_XXX", "B_XXX", "A" AS "A_YYY", "B" AS "B_YYY" FROM ( ( SELECT "A" AS "A_XXX", "B" AS "B_XXX" FROM CTE_values WHERE ("A" = 1) -- filter #1 (df1, inline) ) AS SNOWPARK_LEFT INNER JOIN ( SELECT "A", "B" FROM CTE_values WHERE ("A" = 1) -- filter #2 (df2, inline) ) AS SNOWPARK_RIGHT ) ) AS SNOWPARK_LEFT INNER JOIN ( SELECT "A", "B" FROM CTE_values WHERE ("A" = 1) -- filter #3 (df3, inline) ) AS SNOWPARK_RIGHT )

After (2 CTEs, filter appears 2 times):

WITH CTE_values AS ( SELECT $1 AS "A", $2 AS "B" FROM VALUES (1,2),(3,4) ), CTE_filtered AS ( -- df2 and df3 deduplicated here SELECT "A", "B" FROM CTE_values WHERE ("A" = 1) -- filter #1 (in CTE body) ) SELECT * FROM ( ( SELECT "A_XXX", "B_XXX", "A" AS "A_YYY", "B" AS "B_YYY" FROM ( ( SELECT "A" AS "A_XXX", "B" AS "B_XXX" FROM CTE_values WHERE ("A" = 1) -- filter #2 (df1, still inline) ) AS SNOWPARK_LEFT INNER JOIN (SELECT * FROM CTE_filtered) AS SNOWPARK_RIGHT ) ) AS SNOWPARK_LEFT INNER JOIN (SELECT * FROM CTE_filtered) AS SNOWPARK_RIGHT )

sfc-gh-joshi · 2026-06-02T23:14:23Z

+        # Hash by alias values only, not the UUID keys, since UUID keys are regenerated on every deep-copy/re-resolve (e.g. the two
+        # branches of a self-join). This lets nodes representing the same computation hash identically, enabling CTE dedup for self-joins.
+        # NOTE: since nodes with different UUID keys can now share a CTE, _replace_duplicate_node_with_cte must merge each duplicate's
+        # UUID→alias entries into the shared CTE so parent re-resolution can resolve any UUID variant (see companion comment there).
+        # Different alias values (e.g. a "_WITH_AD_GROUP" join suffix from _disambiguate) still hash differently, preserving SNOW-2261400.
+        string = f"{string}#{sorted(set(node.expr_to_alias.values()))}"


Is there any chance of duplicate value entries here that would silently get swallowed by the sorted(set(...)) call?

…ation

poc fix

bf8dbd0

fix

dd89363

sfc-gh-aling changed the title ~~poc fix~~ SNOW-3485471: self-join not pushed into cte May 29, 2026

sfc-gh-aling changed the title ~~SNOW-3485471: self-join not pushed into cte~~ SNOW-3485471: optimize cte for self-join case May 29, 2026

Merge remote-tracking branch 'origin/main' into aling-poc-cte-optimiz…

556a19e

…ation

sfc-gh-aling requested a review from sfc-gh-aalam May 29, 2026 22:06

sfc-gh-aling changed the title ~~SNOW-3485471: optimize cte for self-join case~~ SNOW-3485471: cte dedup for self-join case May 29, 2026

update changelog

1580af4

sfc-gh-aling marked this pull request as ready for review May 29, 2026 22:51

sfc-gh-aling requested review from a team as code owners May 29, 2026 22:51

sfc-gh-aling requested review from sfc-gh-bkogan and sfc-gh-jrose May 29, 2026 22:51

sfc-gh-aalam reviewed May 29, 2026

View reviewed changes

sfc-gh-aling commented May 29, 2026

View reviewed changes

sfc-gh-aling added 4 commits June 1, 2026 17:56

merge

79d095b

PR comment

a726afe

merge

16dbb7e

update

4b15904

sfc-gh-aling requested a review from sfc-gh-aalam June 2, 2026 17:56

sfc-gh-aalam approved these changes Jun 2, 2026

View reviewed changes

sfc-gh-joshi reviewed Jun 2, 2026

View reviewed changes

sfc-gh-aling requested a review from sfc-gh-joshi June 3, 2026 18:17

sfc-gh-joshi approved these changes Jun 3, 2026

View reviewed changes

Merge remote-tracking branch 'origin/main' into aling-poc-cte-optimiz…

5172f7f

…ation

sfc-gh-aling force-pushed the aling-poc-cte-optimization branch from d854f34 to 5172f7f Compare June 3, 2026 19:11

minor changelog update

c29f910

sfc-gh-aling merged commit dc4deee into main Jun 3, 2026
28 of 29 checks passed

sfc-gh-aling deleted the aling-poc-cte-optimization branch June 3, 2026 20:56

github-actions Bot locked and limited conversation to collaborators Jun 3, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SNOW-3485471: cte dedup for self-join case#4245

SNOW-3485471: cte dedup for self-join case#4245
sfc-gh-aling merged 10 commits into
mainfrom
aling-poc-cte-optimization

sfc-gh-aling commented May 29, 2026 •

edited

Loading

Uh oh!

codecov-commenter commented May 29, 2026 •

edited

Loading

Uh oh!

sfc-gh-aalam May 29, 2026

Uh oh!

sfc-gh-aling Jun 2, 2026

Uh oh!

sfc-gh-joshi Jun 2, 2026

Uh oh!

sfc-gh-aling Jun 3, 2026

Uh oh!

sfc-gh-aling May 29, 2026 •

edited

Loading

Uh oh!

Uh oh!

sfc-gh-joshi Jun 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

sfc-gh-aling commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov-commenter commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

sfc-gh-aalam May 29, 2026

Choose a reason for hiding this comment

Uh oh!

sfc-gh-aling Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

sfc-gh-joshi Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

sfc-gh-aling Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

sfc-gh-aling May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sfc-gh-joshi Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

sfc-gh-aling commented May 29, 2026 •

edited

Loading

codecov-commenter commented May 29, 2026 •

edited

Loading

sfc-gh-aling May 29, 2026 •

edited

Loading